About the Dataset (Summary)

The Used Car Price Prediction dataset contains 4,009 vehicle listings collected from the automotive marketplace cars.com. Each row represents a unique car and includes nine key attributes relevant to pricing and vehicle characteristics. Dataset is taken from Kaggle: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset

The dataset provides information on:

Brand and model – manufacturer and specific vehicle model

Model year – age of the car, influencing depreciation

Mileage – an indicator of usage and wear

Fuel type – e.g., gasoline, diesel, electric, hybrid

Engine type – performance and efficiency characteristics

Transmission – automatic or manual

Exterior/interior colors – aesthetic properties

Accident history – whether the car has previously been damaged

Clean title – legal/ownership status

Price – listed price of the vehicle

Overall, the dataset offers a structured overview of key features that influence used car valuation. It is well-suited for analytical tasks such as understanding pricing drivers, exploring consumer preferences, and building predictive models for vehicle prices. # Raw data

We load the original CSV directly from the project data folder using here() so paths work regardless of the working directory.

raw_path <- here("data", "raw", "used_cars.csv")
cars_raw <- readr::read_csv(raw_path, show_col_types = FALSE)

Basic structure and summary statistics of the raw dataset:

glimpse(cars_raw)
## Rows: 4,009
## Columns: 12
## $ brand        <chr> "Ford", "Hyundai", "Lexus", "INFINITI", "Audi", "Acura", ~
## $ model        <chr> "Utility Police Interceptor Base", "Palisade SEL", "RX 35~
## $ model_year   <dbl> 2013, 2021, 2022, 2015, 2021, 2016, 2017, 2001, 2021, 202~
## $ milage       <chr> "51,000 mi.", "34,742 mi.", "22,372 mi.", "88,900 mi.", "~
## $ fuel_type    <chr> "E85 Flex Fuel", "Gasoline", "Gasoline", "Hybrid", "Gasol~
## $ engine       <chr> "300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capability", "~
## $ transmission <chr> "6-Speed A/T", "8-Speed Automatic", "Automatic", "7-Speed~
## $ ext_col      <chr> "Black", "Moonlight Cloud", "Blue", "Black", "Glacier Whi~
## $ int_col      <chr> "Black", "Gray", "Black", "Black", "Black", "Ebony.", "Bl~
## $ accident     <chr> "At least 1 accident or damage reported", "At least 1 acc~
## $ clean_title  <chr> "Yes", "Yes", NA, "Yes", NA, NA, "Yes", "Yes", "Yes", "Ye~
## $ price        <chr> "$10,300", "$38,005", "$54,598", "$15,500", "$34,999", "$~

Exploratory Data Analysis

We base the EDA on the engineered dataset (data/processed/used_cars_features.csv) that keeps cleaned numeric fields and derived features like age, mileage in thousands, and accident flags.

features_path <- here("data", "processed", "used_cars_features.csv")
cars <- readr::read_delim(features_path, delim = ";", show_col_types = FALSE)

Key descriptive values

Key numeric feature summaries
variable median mean p25 p75 sd min max
price_dollar 28000.00 36865.68 15500.00 46999.00 36531.16 2000.0 649999.00
log_price 10.24 10.19 9.65 10.76 0.82 7.6 13.38
age 9.00 10.32 6.00 14.00 5.87 1.0 29.00
milage_k 63.00 72.14 30.00 103.00 53.60 0.0 405.00
horsepower 310.00 331.51 248.00 400.00 120.32 76.0 1020.00
Accident history distribution
accident n share
At least 1 accident or damage reported 871 0.28
None reported 2194 0.72

Median listing sits around $28k, with the middle 50% between roughly $15.5k and $47k, while the maximum reaches $650k—explaining the heavy right tail. Median age is 9 years (IQR: 6–14), typical mileage is about 63k miles (IQR: 30k–103k), and horsepower clusters around 310 HP (IQR: 248–400). About 28% of cars report an accident or damage, a meaningful factor for pricing.

Price distribution (raw and log)

Raw prices are extremely right-skewed, with most listings below $80k but a long tail of luxury and exotic vehicles. Modeling on this scale would be dominated by a few high-price outliers.

Log transformation produces a more bell-shaped distribution and stabilizes variance, making linear-style models and visual comparisons more reliable.

Depreciation by age and fuel type

Prices decline with age across fuels. Electric listings start high but show the sharpest early drop; diesel holds comparatively high prices across ages (though the diesel sample is small), and gasoline sits lower overall.

Price spread across top brands

Among the 12 most common brands, Porsche leads on median price, followed by Land Rover and Mercedes-Benz; Volume brands (Toyota, Nissan, Jeep) cluster lower with tighter spreads, while some (Chevrolet, Ford) span broader lineups.

Mileage impact by transmission

Higher mileage correlates with lower prices. We use a loess smoother (not a straight trendline) and cap the x-axis at 250k miles to reduce the influence of extreme outliers; automatics show a steady decline, and the smaller manual subset is noisier but similar in direction.

Horsepower premium

Price rises with horsepower, especially from ~250 HP upward; we cap horsepower at 700 to avoid a handful of ultra-high-HP outliers from distorting the loess smoother, so the trend reflects the bulk of the market rather than extreme sports models.

Accident history effect

Cars with reported accidents trade at a clear discount relative to clean histories, even after log-scaling prices, confirming accident history as an important predictor.

SVM model results

We fit radial-kernel SVM regressors on log_price using both e1071::svm and kernlab (via caret::train). Both models use the same 80/20 train-test split; hyperparameters are tuned by cross-validation and evaluated on the hold-out test set below.

svm_metrics <- readr::read_csv(
  here("report", "models", "svm", "svm_log_price_metrics.csv"),
  show_col_types = FALSE
)

svm_best <- readr::read_lines(here("report", "models", "svm", "svm_best_model.txt"))[1]

svm_metrics_wide <- svm_metrics |>
  tidyr::pivot_wider(names_from = .metric, values_from = .estimate)

knitr::kable(svm_metrics_wide, digits = 3, caption = "Test metrics for SVM variants (target: log_price)")
Test metrics for SVM variants (target: log_price)
.estimator model rmse mae rsq
standard e1071_radial 0.290 0.209 0.868
standard kernlab_radial 0.354 0.251 0.803

The e1071 radial SVM is best by RMSE (~0.290) and R² (~0.868), outperforming the kernlab variant on this split. SVMs do not yield straightforward coefficient interpretations; they learn support vectors and decision functions in a transformed feature space. To understand feature effects you would rely on downstream tools (e.g., partial dependence or SHAP), but within this report we focus on comparative error metrics and note that the tuned radial kernel captures non-linear relationships beyond the linear/log models.

Cross-validation setup: both SVMs were tuned with 3-fold cross validation on the training set (same 80/20 split for both). e1071::tune() searched a compact grid of cost and gamma; caret::train(method = "svmRadial") searched a grid of C and sigma. Final metrics shown above are from the untouched test set, so cross validation was only for hyperparameter selection.

Best model: e1071_radial with RMSE = 0.2898 (lower is better)